Study on Volleyball-Movement Pose Recognition Based on Joint Point Sequence

With the high-speed operation of society and the increasing development of modern science, people's quality of life continues to improve. Contemporary people are increasingly concerned about their quality of life, pay attention to body management, and strengthen physical exercise. Volleyball is a sport that is loved by many people. Studying volleyball postures and recognizing and detecting them can provide theoretical guidance and suggestions for people. Besides, when it is applied to competitions, it can also help the judges to make fair and reasonable decisions. At present, pose recognition in ball sports is challenging in action complexity and research data. Meanwhile, the research also has an important application value. Therefore, this article studies human volleyball pose recognition by combining the analysis and summary of the existing human pose recognition studies based on joint point sequences and long short-term memory (LSTM). This article proposes a data preprocessing method based on the angle and relative distance feature enhancement and a ball-motion pose recognition model based on LSTM-Attention. The experimental results show that the data preprocessing method proposed here can further improve the accuracy of gesture recognition. For example, the joint point coordinate information of the coordinate system transformation significantly improves the recognition accuracy of the five ball-motion poses by at least 0.01. In addition, it is concluded that the LSTM-attention recognition model is not only scientific in structure design but also has considerable competitiveness in gesture recognition performance.


Introduction
Te posture of the human body is one of the important biological characteristics of the human body. It has many application scenarios, such as gait analysis, video surveillance, augmented reality, human-computer interaction, fnance, mobile payment, entertainment and games, and sports science. Gesture recognition allows computers to know what a person is doing and who they are. Especially in the feld of monitoring, it is a good solution when the resolution of the face image obtained by the camera is too small. It can also be used as an important auxiliary verifcation method in the target identifcation system to reduce the efect of misidentifcation. Human gesture recognition includes action recognition and identity recognition, and the key lies in human feature extraction. Te human body feature extraction mainly completes action feature extraction and identity feature extraction. In volleyball, detecting and identifying relevant poses in motion sequences can not only provide coaches and players with data-based guidance and advice but also help referees make fair and reasonable decisions in various games. However, most of the existing human gesture recognition methods are aimed at the recognition of daily simple actions [1]. Data collection is simple and further research is needed.
Most of the traditional human pose recognition research is done on sequences of video frames [2]. Although some research results have been achieved, it is difcult to break through the bottleneck of human pose recognition research using video due to changes in light intensity, interference from complex backgrounds, and the self-occlusion of target users. In the recent years, with the rapid development of video capture technology, such as Kinect, researchers can easily obtain coordinated information on the image, depth image, and skeletal joint points [3,4]. Te information provided by depth images [5] can refect the three-dimensional structural information and the geometry of target objects well compared with the images. Moreover, it has strong robustness to the infuence of factors such as light intensity and scale changes. Te posture of ball sports is more complex compared with the simple daily posture of the human body, and it also has requirements for research data. Besides, the existing recognition methods cannot efectively judge the ball movement gestures due to the change in the difculty of gesture recognition. Terefore, further research on pose data and recognition methods is needed to improve the accuracy of ball-motion pose prediction classes.
In this article, volleyball is represented. Te problem of volleyball-gesture recognition is studied combined with the analysis and summary of the existing research on human gesture recognition based on the joint point sequence and the long short-term memory (LSTM) network [6]. In addition, this article proposes a data preprocessing method based on the angle and relative distance feature enhancement and a volleyball-motion pose recognition model based on LSTM-attention. Experimental results show that the data preprocessing method reported here can further improve the accuracy of gesture recognition. Besides, the LSTM-attention recognition model is not only scientifc in structure design but also has considerable competitiveness in gesture recognition performance, which can provide a basis for the research on gesture recognition in volleyball. Although the LSTM-attention recognition model is not only scientifcally designed in structure but also quite competitive in gesture recognition performance, this approach is not yet universal and should be refned in future studies.

Recurrent Neural Network (RNN)
. Te RNN is obtained by simulating the human neural transmission system [7]. When humans are thinking, such as reading an article, they may not be able to understand the meaning of the article only by relying on the information currently read. It is often necessary to combine the previous content to understand the essence of the article. Tis shows that humans are not in a blank brain when they think about problems. Te brain will not discard the content of the articles that have been read before but will understand and analyse based on the previous readings. Tis reveals that human thinking is a continuous process [8]. However, traditional neural networks cannot achieve this, so there is an RNN. Te RNN is a neural network with a short-term memory and continuously transmits information by adding loops to the network. It is suitable for processing sequential data. Figure 1 shows the general structure of an RNN. In Figure 1, h t is the value of the hidden layer at time t, x t is the input of the network at time t, and h t is its output.
Each block of the neural network in Figure 1 is represented by A. Figure 2 shows the extended structure of the RNN, which is also the internal structure of the RNN transformed according to the time dimension [9]. In the RNN, there is a signal transfer between all the hidden layer nodes [10]. Te output of the hidden layer of the RNN at time t is fed back to the RNN at time t + 1. Ten, the input of the RNN at time t + 1 and the output of the RNN at time t will act on the output of the RNN at time t + 1. Te chain structure of the RNN essentially determines its strong processing capability for the sequence data. In the recent years, the RNN has performed very well in language modeling, speech recognition, image captioning, and machine translation.
Te number of units in the input layer of a neural network is fxed, so inputs of variable length must be processed in a loop or recursion. RNN implements the former. It works by dividing an input of variable length into small chunks of equal length. Ten, they are sequentially input into the network, realizing the processing of the variable length input by the neural network. Te RNN can encode a tree/graph structure information as a vector, mapping the information into a semantic vector space. Tis semantic vector space satisfes a class of properties. For example, semantically similar vectors are close together. However, a big limitation of the RNN is the vanishing gradient problem [11]. Te RNN is a short-term memory neural network that can only memorize short-distance information sequences. When the time interval becomes large, the RNN will gradually lose its ability to learn the information of the previous time nodes, which will make the learning of the RNN very difcult. Tis problem is also known as the "long sentence dependency problem."   Computational Intelligence and Neuroscience

LSTM.
Te researchers make related improvements to its basic structure to solve the "long sentence dependency problem" of the RNN. Te researchers design the cell state internally to record the historical state information [12] and introduce the gating unit to control the node information of the hidden layer. Tis variant of the RNN can solve the above problems very well. It can memorize long-term information related to the current recognition task. Tis variant is called the LSTM network. LSTM can be regarded as a special type of the RNN [13], which can greatly enhance the network's ability to store information within long time intervals. For LSTM and the RNN, the same is that they are both chain structures, and the diference is the structure inside their network. LSTM is the most efective sequence model in deep learning (DL) [14], which mainly consists of the forget gate, input gate, and output gate. Te RNN model has the problem of missing gradients. LSTM efectively avoids this drawback and proposes a new cell structure, which can judge the retention or forgetting of data. Real-time data are processed from the far left to the far right [15]. Also, the data are processed from the input. Terefore, it is necessary to judge which information continues to run and which is abandoned in the endless input information. Tis process follows a switch control, which is f((t)).
Te control function is as follows: In equation (1), w f and b f are the weight and bias of the forget gate, respectively. Te previous information is input into the input gate. Te task at this layer is to decide which information needs to be updated and how much to update. σ represents the activation function, b f represents the bias value, h (t− 1) represents the short-term memory, and x t represents the current input.
In equations (2)-(4), w i and w c represent the corresponding weights, b i and b c represent the corresponding biases, and C t represents the current cell state value. After the screening of the frst two gates is completed, the output gate determines which information needs to be output. Tere is a switch to control the output in the output gate.
In equations (5) and (6), w o and b o represent the weight and bias of the output gate. o t represents the output gating unit, h t is the output value of the current unit, and σ represents the activation function.

Volleyball-Movement Pose Recognition Method Based on LSTM-Attention.
Tis section constructs a volleyballmovement posture recognition method based on LSTM-Attention to help LSTM efectively extract the feature information before and after the action [16]. Tis section frst gives a brief overview of the model and analyses the various modules involved. Te experimental results demonstrate the efectiveness of the LSTM-Attention method proposed here in ball-motion pose recognition. In the recent years, the research on human gesture recognition has gradually replaced the status of traditional methods with the continuous development of DL-related technologies. In human gesture recognition, although the RNNbased gesture recognition method has obvious advantages in short-term memory, it has great difculties in dealing with some recognition scenarios that require long-term memory. As a special type of the RNN, LSTM can not only solve the problem of the disappearance of the RNN gradient but also enhance the network's ability to memorize information for long time intervals. Te latter is favored by many researchers compared with the former because it is good at obtaining feature information between long-term sequences. Many networks optimized based on this have been produced with the in-depth study of the LSTM neural network. Tey are widely used in text, speech, and image recognition. In the research on human posture recognition, diferent human skeletal structures will produce great diferences when they are playing volleyball. Tis can lead to indistinguishable target users in the same ball game pose. For this problem, this article proposes a feature enhancement preprocessing method based on the angle and the relative distance. Besides, contextual information between action sequences plays a crucial role in gesture recognition. Tis section constructs a ball-motion pose recognition method based on the LSTM-Attention model to efectively extract the long-sequence feature information of ball motion poses. Te overall process of ball sports gesture recognition is shown in Figure 3.
First, Kinect is used to acquire the joint point coordinate data of the human skeleton performing ball motion [17]. Ten, the scale-invariant angle and relative distance features are extracted from the joint point information. Finally, the LSTM-attention network is used to mine the deep timing information in the skeleton sequence of the human body when ball sports are performed. Furthermore, this information is combined with spatial features to recognize the ball motion pose of the human body [18]. Te model learns the correlation between the time series data of ball motion poses autonomously and efectively by combining LSTM with the attention mechanism to improve the accuracy of the model pose recognition.

Angle Feature Extraction.
In the process of the human body performing ball sports, the features selected based on joint point information should have general behavior [19]. Features do not vary greatly due to diferences in the human skeletal structure, and they do not shift because the target user and the Kinect depth sensor are in diferent Computational Intelligence and Neuroscience positions. Terefore, the angle features extracted based on the joint point information are used to predict the pose category of the human body during ball sports [20]. When the human body performs diferent ball sports, there will be diferent angular relationships between the joint points of the human bones in space. Especially for the joint points on the arm, the included angle between the joint points involved in the corresponding action of the ball sports will have a relatively fxed range of variation. Tis can intuitively describe the ball game posture. Terefore, the coordinate information of the eight joint points is decomposed into fve parts according to the human body structure, including the trunk and the limbs. Ten, the angle feature extraction is performed on the joint-point coordinate information of these fve parts.
When the angle of the joint point information of the human skeleton is calculated, the limbs of each part in the human skeleton model need to be regarded as a vector. Te correspondence between the angles between the joint points and each component vector is shown in Table 1. r ij represents the vector that forms the angle of a joint point.
In Figure 4, the right arm model of the human body is taken as an example [21], and the joint angle r 5 consists of two vectors, r 4,5 and r 5,6 , respectively.

Relative Distance Feature Extraction.
Te relative distance feature between the human skeleton joint points is a kind of information that can express diferent ball-motion pose data in the spatial dimension [22]. When the human body performs a ball action, the spatial position information of the joint points of each part of the human skeleton will also change. Moreover, the relative distance between some joint points that change with the movement will also form a change rule for an action posture [23]. For example, when people perform a badminton swing, the relative distance between the user's hand and the base of the spine is a very expressive information. Terefore, the relative distance feature extracted based on joint point information is used here to analyse the pose category of the human body when they perform ball sports.
It is found that the joint points at the base of the spine have stability in the process of the human body movement in expressing the human body ball sports posture through analysis of the joint point data when the human body performs ball sports posture. Terefore, this article regards the joint point at the base of the spine as the center point and analyses the ball sports posture based on the relative distance between the center point and other joint points.

Ball-Movement Pose Recognition Method Based on LSTM-Attention.
Here, Kinect is used to collect the joint point data of the human body during ball sports [24]. Based on this, two kinds of geometric features are artifcially designed to describe the pose of the ball. Te appropriate relevant parameters in the model are obtained after the LSTM-attention network model constructed here is trained on the ball sports training set. Besides, the recognition and classifcation of ball sports poses are carried out on the test set. Its network structure is demonstrated in Figure 5.
In this model, a multilevel LSTM structure is designed to improve the learning ability of the recognition network [25] to handle complex feature representations in ball sports poses. Te number of network layers of LTSM is designed to be three layers. An attention mechanism is added to the model. Tis design enables the feature vector to spontaneously perceive the network weights that signifcantly impact the recognition results of ball motion gestures. Some important feature information gets attention. Tis can also   perform further feature enhancement on the feature data extracted from the previous network layers. In addition, a dropout layer is added between the LSTM structures. Tis can reduce the occurrence of overftting of the gesture recognition model when the number of experimental samples is limited. To sum up, feature learning through the multilevel LSTM network is combined with feature enhancement of the attention mechanism. Tis enables the network model to fully and efectively learn the correlation between the time series data in the ball motion poses, thereby improving the pose recognition ability of the entire network model. Figure 5 displays a diagram of the LSTM-attention network model that is expanded according to time. Te model can form an action sequence according to the time sequence of the features of the human body during ball sports, and it is used as the input of the ball-sports pose recognition network. Te input includes an angle feature and a relative distance feature. Te ball motion features of each human body have become a 38-dimensional data through the previously mentioned joint evaluation, repair, and feature extraction. Also, the length of each action sequence will be afected by the diferent frame numbers of diferent actions. It is indispensable to perform isometric operations on the data in the dataset before the feature data are input into the gesture recognition network model. According to the longest frame value in each ball motion sequence, the other motion sequences are zeroed.
Te multidimensional feature sequence is input into the LSTM-attention network, which is processed by LSTM, dropout, and the attention mechanism. Te intermediate value is sent to the output layer. Te function used in this layer is the softmax function. Te function can judge the corresponding ball sports posture and output the probability value of fve diferent ball-sports posture labels. Finally, the maximum value of the probability evaluation values is selected as the output category of the fnal ballmotion pose.
When the human body performs diferent ball motion poses, all the joint motion data contained in the human skeleton are not equally important. For example, the changes in the joint point data of the human bones are mainly concentrated in the right arm part in the process of the human body completing the badminton swing. Te joint point data of other parts of the body have little efect on the fnal gesture recognition efect. Terefore, the attention mechanism is introduced into the improved human ballmotion pose recognition model. In the process of movement, the important data of human limbs and joints can be marked and much attention can be given to them. In the research data here, the coordinate data of ffteen joint points of the human body during ball sports are collected, and the human skeleton model is established based on this. In most cases, the joint point information that a ball motion pose can be associated with is fxed. Tese fxed-joint point information will be converted into feature vectors through LSTM. Te essence of the attention mechanism is to perform a weighted summation of these feature vectors to fnd out the joint point information that importantly impacts the recognition of ball motion poses.

Results and Discussion
Te comparative experiments before and after the coordinate system transformation of joint point information verify the improvement efect of the joint-point preprocessing method based on the coordinate system transformation on the accuracy of gesture recognition. Tis section conducts experiments. First, this article conducts experiments on the joint point data before and after the coordinate system transformation through the traditional LSTM-pose recognition network model. MATLAB software is used to simulate and simulate the gesture recognition process of the network model. Te results are shown in Figure 6.  Computational Intelligence and Neuroscience Figure 6 shows the comparative experimental results before and after the coordinate information conversion preprocessing of the joint point data through the LSTM neural network. Te experimental results indicate that the joint-point coordinate information of the coordinate system transformation signifcantly improves the recognition accuracy of the fve ball motion poses by at least 0.01. However, the improvement in gesture recognition accuracy for smash actions is lower compared to the improvements in gesture recognition accuracy for serve, lift, high clear, and backhand. Tis also means that gesture recognition accuracy for actions like smash is less afected by angular changes during data collection than other actions.
In addition to LSTM, this section also conducts comparative experiments on gesture recognition methods such as BiLSTM, linear SVM, and multilayer LSTM. Te experimental data and settings remain the same as in the above LSTM experiments. BiLSTM is the abbreviation of bidirectional long short-term memory, which means a bidirectional long and short-term neural network. It is composed of forward LSTM and backward LSTM. Both are often used to model contextual information in natural language processing tasks. Figure 7 shows the overall prediction results of each gesture recognition method for the joint-point coordinate information before and after the coordinate system conversion.
Te experimental results show that the joint-point coordinate information after the coordinate system transformation can efectively improve the overall accuracy of the ball-motion gesture recognition network model. After converting the coordinate system, the accuracy is improved by at least 0.01, and the accuracy of the method proposed here is higher than the other methods. Te optimal hierarchical experiment of the LSTM multilayer structure is expected to integrate the feature information of the longterm sequence on a global scale and realize the high-level abstraction of the input human skeleton joint point data. Terefore, this article constructs a multilevel LSTM structure based on a classifcation model. However, if the number of layers in the LSTM multilayer structure is large, the model will take a lot of time to converge, which will complicate the model. Terefore, this article conducts comparative experiments on diferent layers of LSTM structures on the Bad-mintonData and MSRAction3D datasets to verify the scientifcity and efectiveness of the ball-motion gesture recognition model based on the three-layer LSTM structure designed here. Te BadmintonData dataset contains various volleyball poses. Te MSRAction3D dataset records 20 actions and ten subjects. Each subject performs each action two to three times. Tere are a total of 567 depth map sequences with a resolution of 640 * 240. Data are recorded with a depth sensor similar to the Kinect unit.
Te number of layers of the LSTM multilayer structure is set as one, two, three, four, and fve, respectively. LSTM-Attention_n is a model that fuses the n-layer LSTM structure and the attention mechanism, respectively. Besides, LSTM is a model that only contains a single-layer LSTM structure without an attention mechanism. Te results are revealed in Figure 8.
Te purpose of the comparative experiment before and after the restoration of joint point information is to verify the improvement efect of the joint-point processing method based on the bone length and motion continuity on the performance of the pose recognition method. Its essence is  Computational Intelligence and Neuroscience to evaluate the infuence of the joint point data with errors on the accuracy of human pose recognition. Tis section uses the abovementioned gesture recognition methods to conduct comparative experiments on the BadmintonData and MSRAction3D datasets. In the experiment, other experimental confgurations are the same except for the preprocessing process of joint-point data repair. Te two datasets after joint data repair processing are recorded as BadmintonData_Repairing and MSRAction3-D_Repairing, respectively. Ten, model generation and prediction are performed on the BadmintonData, MSRAction3D, BadmintonData_Repairing, and MSRAc-tion3D_Repairing datasets using the above four pose recognition methods, respectively. Te predicted results are shown in Table 2.
From the experiment, the same action poses in the BadmintonData_Repairing and MSRAction3D_Repairing datasets are recognized, and the pose recognition results after the joint repair operation will be much higher than the data without joint repair. It is proved that the processing method based on the bone length and motion continuity proposed in this article can improve the fnal accuracy of the gesture recognition network. Terefore, it is crucial to evaluate the reliability of the joint-point coordinate information obtained by Kinect and restore the joint point information with errors before the joint-point coordinate information is used for action and pose recognition.

Conclusion
Studying, recognizing, and detecting volleyball postures can provide theoretical guidance and suggestions for people, which can be applied in competitions. It also helps event judges make sound decisions. At present, the pose recognition of ball sports is very challenging in action complexity and research data, and this research also has an important application value. Tis article takes volleyball as the representative to study the problem of volleyball movement pose recognition combined with the analysis and summary of the existing human pose recognition research based on the joint point sequence and the LSTM network. Tis article proposes a data preprocessing method based on the angle and relative distance feature enhancement and a volleyball motion pose recognition model based on LSTM-attention. Te experimental results imply that the data preprocessing method reported here can further improve the accuracy of gesture recognition. Te joint-point coordinate information of the coordinate system transformation signifcantly improves the recognition accuracy of the fve ball motion poses by at least 0.01. Moreover, it is concluded that the LSTM-attention recognition model is not only scientifc in structure design but also has considerable competitiveness in gesture recognition performance. However, this method is not yet universal and should be refned in future research.

Data Availability
Te experimental data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.